🧩 Visualization - Reusable Functions for Impact Analysis Part I¶

This section provides reusable, parameterized functions for analyzing and visualizing performance metrics across temporal and categorical dimensions. Designed for flexibility and clarity, the functions support:

  • Dynamic grouping by time (year, month_name, day_name) or category (store, promo, etc.)

  • Preprocessing filters to exclude non-operational records (e.g., closed stores, zero-sales days)

  • Statistical summaries including mean, standard deviation, and count

  • Ranked insights with volatility and performance differentials

  • Interactive visualizations via Plotly for enhanced interpretability

These tools enable scalable impact assessments and trend analyses across diverse datasets with minimal code repetition.

1. Setup & Imports Libraries¶


In [1]:
import time 
In [2]:
# Step 1: Setup & Imports Libraries
print("Step 1: Setup and Import Libraries started...")
time.sleep(1)  # Simulate processing time
Step 1: Setup and Import Libraries started...
In [3]:
# Data Manipulation & Processing
import math
import numpy as np
import pandas as pd
import scipy.stats as stats
from datetime import datetime
from sklearn.preprocessing import *

# Data Visualization
import seaborn as sbn
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from pandas.plotting import scatter_matrix

# to ensure Plotly works in both Jupyter and HTML export
pio.renderers.default = "notebook+plotly_mimetype"

sbn.set(rc={'figure.figsize':(14,6)})
plt.style.use('seaborn-v0_8')
sbn.set_palette("husl")

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

# Warnings
import warnings
warnings.simplefilter('ignore')
warnings.filterwarnings('ignore')
In [4]:
print("="*60)
print("Rossman Store Sales Time Series Analysis - Part 2")
print("="*60)
print("All libraries imported successfully!")
print("Analysis Date:", pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S'))
============================================================
Rossman Store Sales Time Series Analysis - Part 2
============================================================
All libraries imported successfully!
Analysis Date: 2025-08-16 01:02:27
In [5]:
print("✅ Setup and Import Liraries completed.\n")
✅ Setup and Import Liraries completed.

In [6]:
# Start Impact Analysis

viz_impact_analysis_begin = pd.Timestamp.now()

bold_start = '\033[1m'
bold_end = '\033[0m'

print("🔍 Viz impact Analysis Started ...")
print(f"🟢 Begin Date: {bold_start}{viz_impact_analysis_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}\n")
🔍 Viz impact Analysis Started ...
🟢 Begin Date: 2025-08-16 01:02:27

Restore the file¶


In [7]:
%store -r df_viz_feat

View or Display Dataset¶

In [8]:
print("\nTrain Data Preview:")
print("\n",df_viz_feat.head())
Train Data Preview:

         store  dayofweek       date  sales  customers  open     promo stateholiday  schoolholiday  day  week month  quarter  year  isweekend  isholiday  isschoolDay
982643   1115          2 2013-01-01      0          0     0  No Promo       Public              1  Tue     1   Jan        1  2013      False       True        False
982640   1112          2 2013-01-01      0          0     0  No Promo       Public              1  Tue     1   Jan        1  2013      False       True        False
982639   1111          2 2013-01-01      0          0     0  No Promo       Public              1  Tue     1   Jan        1  2013      False       True        False
982638   1110          2 2013-01-01      0          0     0  No Promo       Public              1  Tue     1   Jan        1  2013      False       True        False
982637   1109          2 2013-01-01      0          0     0  No Promo       Public              1  Tue     1   Jan        1  2013      False       True        False
In [9]:
# Step 2: Data Ingestion
print("Step 2: Features Engineering started...")
time.sleep(1)  # Simulate processing time
Step 2: Features Engineering started...
In [10]:
# Make a copy of the original dataframe to avoid modifying it
df_features = df_viz_feat.copy()

🧩 Reusable Utility Functions for Analytical Workflows¶


Promotion Impact Analysis¶

In [11]:
def analyze_promotion_impact(df, target_col='sales', category_col='day', width=1200, height=500):
    """
    Analyze promotion impact across any category (day, month, store, etc.)
    
    Parameters:
    - df: DataFrame with 'promo' column
    - target_col: metric to analyze ('sales' or 'customers')
    - category_col: grouping category ('day', 'month', 'store', etc.)
    """
    # Filter out closed stores (e.g., sales = 0) for unbiased and fair comparison
    df = df[df[target_col] > 0].copy()
    
    # Create summary data
    summary = df.groupby([category_col, 'promo'])[target_col].mean().reset_index()
    
    # Visualization
    fig = px.bar(
        summary,
        x=category_col,
        y=target_col,
        color='promo',
        title=f'Promotion Impact: {target_col.title()} by {category_col.title()}',
        color_discrete_map={'Promo': '#636EFA', 'No Promo': '#EF553B'},
        barmode='group'
    )
    fig.update_layout(title_x=0.5, height=height, width=width)
    fig.show(config={'displayModeBar': True, 'displaylogo': False})
    
    # Impact analysis
    print(f"Promotion Impact Analysis - {target_col.title()} by {category_col.title()}:")
    print("=" * 60)
    
    categories = sorted(df[category_col].unique())
    for category in categories:
        no_promo = df[(df[category_col] == category) & (df['promo'] == 'No Promo')][target_col].mean()
        promo = df[(df[category_col] == category) & (df['promo'] == 'Promo')][target_col].mean()
        
        if not pd.isna(no_promo) and not pd.isna(promo):
            lift = ((promo - no_promo) / no_promo) * 100
            currency = "€" if target_col == 'sales' else ""
            print(f"{str(category):12}: No Promo {currency}{no_promo:6,.0f} | Promo {currency}{promo:6,.0f} | Lift {lift:+5.1f}%")
    
    # Overall summary
    overall_no_promo = df[df['promo'] == 'No Promo'][target_col].mean()
    overall_promo = df[df['promo'] == 'Promo'][target_col].mean()
    overall_lift = ((overall_promo - overall_no_promo) / overall_no_promo) * 100
    currency = "€" if target_col == 'sales' else ""
    
    print(f"\nOverall Impact:")
    print(f"Average lift from promotions: {overall_lift:+.1f}%")
    print(f"Additional revenue per day: {currency}{overall_promo - overall_no_promo:,.0f}")
    
    return summary

Top performer Impact Analysis¶

In [12]:
def analyze_top_performers(df, group_col='store', target_col='sales', top_n=10, width=1200, height=500):
    """
    Analyze top performing entities (stores, days, months, etc.)
    
    Parameters:
    - df: DataFrame 
    - group_col: column to group by ('store', 'day', 'month', etc.)
    - target_col: metric to analyze ('sales' or 'customers')  
    - top_n: number of top performers to show
    """
    
    # Filter out closed stores (e.g., sales = 0) for unbiased and fair comparison
    df = df[df[target_col] > 0].copy()
    
    # Calculate averages and get top performers
    group_avg = df.groupby(group_col)[target_col].mean()
    top_performers = group_avg.nlargest(top_n)
    
    # Visualization
    fig = px.bar(
        x=top_performers.index.astype(str),
        y=top_performers.values,
        title=f'Top {top_n} {group_col.title()} by Average {target_col.title()}',
        labels={'x': group_col.title(), 'y': f'Average {target_col.title()}'}
    )
    fig.update_layout(title_x=0.5, height=height, width=width)
    fig.show(config={'displayModeBar': True, 'displaylogo': False})
    
    # Performance analysis
    print(f"Top {top_n} {group_col.title()} Performance Analysis:")
    print("=" * 55)
    print(f"{'Rank':<4} {group_col.title():<10} {'Average':<15} {'% of #1':<10}")
    print("-" * 55)
    
    for i, (entity, avg_value) in enumerate(top_performers.items(), 1):
        pct_of_top = (avg_value / top_performers.iloc[0]) * 100
        currency = "€" if target_col == 'sales' else ""
        print(f"{i:<4} {str(entity):<10} {currency}{avg_value:>9,.0f}     {pct_of_top:>6.1f}%")
    
    # Summary statistics
    total_entities = len(group_avg)
    top_avg = top_performers.mean()
    overall_avg = group_avg.mean()
    performance_gap = ((top_avg - overall_avg) / overall_avg) * 100
    
    print(f"\nSummary Statistics:")
    print(f"Total {group_col}s analyzed: {total_entities:,}")
    print(f"Top {top_n} average: {currency}{top_avg:,.0f}")
    print(f"Overall average: {currency}{overall_avg:,.0f}")
    print(f"Top {top_n} outperform by: {performance_gap:.1f}%")
    
    return top_performers

Temporal Trends Impact Analysis¶

In [13]:
def analyze_temporal_trends(df, time_col='month', target_col='sales', width=1200, height=500):
    """
    Analyze trends over time periods (month_name, day_name, year, etc.)
    
    Parameters:
    - df: DataFrame
    - time_col: time dimension ('month_name', 'day_name', 'year', etc.)
    - target_col: metric to analyze ('sales' or 'customers')
    """
    # Filter out closed stores (e.g., sales = 0) for unbiased and fair comparison
    df = df[df[target_col] > 0].copy()

   # Calculate averages by time period
    time_stats = df.groupby(time_col).agg({
        target_col: ['mean', 'std', 'count']
    }).round(0)
    
    time_stats.columns = ['avg', 'std', 'count']
    time_stats = time_stats.reset_index().sort_values('avg', ascending=False)
    
    # Visualization
    fig = px.bar(
        time_stats,
        x=time_col,
        y='avg',
        title=f'{target_col.title()} Performance by {time_col.title()}',
        color='avg',
        color_continuous_scale='viridis'
    )
    fig.update_layout(title_x=0.5, height=height, width=width)
    fig.show(config={'displayModeBar': True, 'displaylogo': False})
    
    # Trend analysis
    print(f"{time_col.title()} Performance Analysis:")
    print("=" * 50)
    print(f"{'Rank':<4} {time_col.title():<12} {'Average':<15} {'Std Dev':<10} {'Count':<8}")
    print("-" * 60)
    
    for i, row in time_stats.iterrows():
        currency = "€" if target_col == 'sales' else ""
        print(f"{i+1:<4} {str(row[time_col]):<12} {currency}{row['avg']:>9,.0f}     {currency}{row['std']:>6,.0f}   {row['count']:>6,.0f}")
    
    # Key insights
    best_period = time_stats.iloc[0][time_col]
    worst_period = time_stats.iloc[-1][time_col]
    best_value = time_stats.iloc[0]['avg']
    worst_value = time_stats.iloc[-1]['avg']
    volatility = ((best_value - worst_value) / time_stats['avg'].mean()) * 100
    
    currency = "€" if target_col == 'sales' else ""
    print(f"\nKey Insights:")
    print(f"Best {time_col}: {best_period} ({currency}{best_value:,.0f})")
    print(f"Worst {time_col}: {worst_period} ({currency}{worst_value:,.0f})")
    print(f"Performance range: {currency}{best_value - worst_value:,.0f}")
    print(f"Volatility: {volatility:.1f}%")
    
    return time_stats

State Holiday Impact Analysis¶

In [14]:
def analyze_stateholiday_impact(df, target_col ='sales', category_col ='day', width =1200, height =500):
    """
    Analyze the impact of state holidays across a specified category (e.g., day, month, store).
    
    Parameters:
    - df: DataFrame with 'stateholiday' column already mapped to labels
    - target_col: metric to analyze ('sales' or 'customers')
    - category_col: grouping category ('day', 'month', 'store', etc.)
    """
    # Filter out closed stores (e.g., sales = 0) for unbiased and fair comparison
    df_open = df[df[target_col] > 0].copy()
    
    # Create summary data
    summary = df_open.groupby([category_col, 'stateholiday'])[target_col].mean().reset_index()
    
    # Visualization
    fig = px.bar(
        summary,
        x =category_col,
        y =target_col,
        color ='stateholiday',
        title =f'State Holiday Impact: {target_col.title()} by {category_col.title()}',
        barmode ='group'
    )
    fig.update_layout(title_x =0.5, height =height, width =width)
    fig.show(config={'displayModeBar': True, 'displaylogo': False})
    
    # Impact analysis
    print(f"State Holiday Impact Analysis - {target_col.title()} by {category_col.title()}:")
    print("=" * 70)
    
    # Get regular day baseline
    regular_avg = df_open[df_open['stateholiday'] == 'Normal Day'][target_col].mean()
    
    # Overall holiday impact
    print("Overall Holiday Impact:")
    print("-" * 25)
    for holiday_label in ['Public', 'Easter', 'Christmas']:
        if holiday_label in df['stateholiday'].values:
            holiday_avg = df_open[df_open['stateholiday'] == holiday_label][target_col].mean()
            if not pd.isna(holiday_avg):
                impact = ((holiday_avg - regular_avg) / regular_avg) * 100
                currency = "€" if target_col == 'sales' else ""
                print(f"{holiday_label:15}: {currency}{holiday_avg:6,.0f} ({impact:+5.1f}% vs regular)")
    
    print(f"Regular Days: €{regular_avg:6,.0f} (baseline)")
    
    # Category-wise analysis
    print(f"\nHoliday Impact by {category_col.title()}:")
    print("-" * 35)
    
    categories = sorted(df[category_col].unique())
    for category in categories:
        regular = df_open[(df_open[category_col] == category) & (df_open['stateholiday'] == 'Normal Day')][target_col].mean()
        holiday = df_open[(df_open[category_col] == category) & (df_open['stateholiday'] != 'Normal Day')][target_col].mean()
        
        if not pd.isna(regular):
            currency = "€" if target_col == 'sales' else ""
            if not pd.isna(holiday):
                impact = ((holiday - regular) / regular) * 100
                print(f"{str(category):12}: Regular {currency}{regular:6,.0f} | Holiday {currency}{holiday:6,.0f} | Impact {impact:+5.1f}%")
            else:
                print(f"{str(category):12}: Regular {currency}{regular:6,.0f} | No holiday data")
    
    # Store closure analysis
    total_records = len(df)
    closed_stores = len(df[df[target_col] == 0])
    holiday_closures = len(df[(df['stateholiday'] != 'Normal Day') & (df[target_col] == 0)])
    regular_closures = len(df[(df['stateholiday'] == 'Normal Day') & (df[target_col] == 0)])
    
    print(f"\nStore Operations Impact:")
    print("-" * 25)
    print(f"Total store closures: {closed_stores:,} ({(closed_stores/total_records)*100:.1f}%)")
    print(f"Holiday closures: {holiday_closures:,}")
    print(f"Regular closures: {regular_closures:,}")
    
    return summary

RUNNING TOP 4 IMPORTANT ANALYSIS¶

In [15]:
# Example usage:
if __name__ == "__main__":
    # Assuming df_features is your dataset
    
    print("=== RUNNING TOP 4 MOST IMPORTANT ANALYSES ===\n")
    
    # 1. Promotion Impact Analysis
    print("1. PROMOTION IMPACT ANALYSIS")
    print("-" * 40)
    promo_results = analyze_promotion_impact(df_viz_feat, target_col='sales', category_col='day')
    
    print("\n" + "="*80 + "\n")
    
    # 2. State Holiday Impact Analysis
    print("2. STATE HOLIDAY IMPACT ANALYSIS")
    print("-" * 40)
    holiday_results = analyze_stateholiday_impact(df_viz_feat, target_col='sales', category_col='day')
    
    print("\n" + "="*80 + "\n")
    
    # 3. Top Store Performance Analysis  
    print("3. TOP STORE PERFORMANCE ANALYSIS")
    print("-" * 40)
    store_results = analyze_top_performers(df_viz_feat, group_col='store', target_col='sales', top_n=10)
    
    print("\n" + "="*80 + "\n")
    
    # 4. Temporal Trends Analysis
    print("4. TEMPORAL TRENDS ANALYSIS")
    print("-" * 40)
    trend_results = analyze_temporal_trends(df_viz_feat, time_col='month', target_col='sales')
=== RUNNING TOP 4 MOST IMPORTANT ANALYSES ===

1. PROMOTION IMPACT ANALYSIS
----------------------------------------
Promotion Impact Analysis - Sales by Day:
============================================================
Fri         : No Promo € 6,352 | Promo € 7,734 | Lift +21.8%
Mon         : No Promo € 6,205 | Promo € 9,776 | Lift +57.5%
Thu         : No Promo € 5,737 | Promo € 7,595 | Lift +32.4%
Tue         : No Promo € 5,716 | Promo € 8,277 | Lift +44.8%
Wed         : No Promo € 5,610 | Promo € 7,678 | Lift +36.9%

Overall Impact:
Average lift from promotions: +38.8%
Additional revenue per day: €2,302

================================================================================

2. STATE HOLIDAY IMPACT ANALYSIS
----------------------------------------
State Holiday Impact Analysis - Sales by Day:
======================================================================
Overall Holiday Impact:
-------------------------
Public         : € 8,487 (+21.8% vs regular)
Easter         : € 9,888 (+41.9% vs regular)
Christmas      : € 9,744 (+39.9% vs regular)
Regular Days: € 6,967 (baseline)

Holiday Impact by Day:
-----------------------------------
Fri         : Regular € 7,080 | Holiday € 6,879 | Impact  -2.8%
Mon         : Regular € 8,231 | Holiday € 8,049 | Impact  -2.2%
Sat         : Regular € 5,895 | Holiday € 5,732 | Impact  -2.8%
Sun         : Regular € 8,191 | Holiday € 8,406 | Impact  +2.6%
Thu         : Regular € 6,783 | Holiday € 6,407 | Impact  -5.6%
Tue         : Regular € 7,103 | Holiday € 6,939 | Impact  -2.3%
Wed         : Regular € 6,738 | Holiday € 6,396 | Impact  -5.1%

Store Operations Impact:
-------------------------
Total store closures: 168,494 (17.1%)
Holiday closures: 39,994
Regular closures: 128,500

================================================================================

3. TOP STORE PERFORMANCE ANALYSIS
----------------------------------------
Top 10 Store Performance Analysis:
=======================================================
Rank Store      Average         % of #1   
-------------------------------------------------------
1    817        €   21,789      100.0%
2    262        €   20,684       94.9%
3    1114       €   20,576       94.4%
4    251        €   19,158       87.9%
5    842        €   18,472       84.8%
6    513        €   18,212       83.6%
7    562        €   17,985       82.5%
8    788        €   17,968       82.5%
9    383        €   17,293       79.4%
10   756        €   16,575       76.1%

Summary Statistics:
Total stores analyzed: 1,115
Top 10 average: €18,871
Overall average: €6,931
Top 10 outperform by: 172.3%

================================================================================

4. TEMPORAL TRENDS ANALYSIS
----------------------------------------
Month Performance Analysis:
==================================================
Rank Month        Average         Std Dev    Count   
------------------------------------------------------------
3    Dec          €    8,609     € 4,094   50,393
10   Nov          €    7,189     € 3,010   51,401
9    May          €    7,107     € 3,008   80,099
1    Apr          €    7,047     € 3,138   81,726
7    Jun          €    7,001     € 3,145   82,571
8    Mar          €    6,977     € 3,124   85,975
6    Jul          €    6,910     € 3,038   55,388
2    Aug          €    6,649     € 2,902   54,411
11   Oct          €    6,603     € 2,815   53,291
4    Feb          €    6,589     € 2,871   80,239
5    Jan          €    6,564     € 2,849   86,335
12   Sep          €    6,547     € 2,889   52,321

Key Insights:
Best month: Dec (€8,609)
Worst month: Sep (€6,547)
Performance range: €2,062
Volatility: 29.5%
In [16]:
print("✅ Data Visualization Impact Analysis completed.\n")
✅ Data Visualization Impact Analysis completed.

In [17]:
print("✅ Features Engineering and Data Visualization (I) completed successfully!")
print(f"🗓️ Analysis Date: {bold_start}{pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
✅ Features Engineering and Data Visualization (I) completed successfully!
🗓️ Analysis Date: 2025-08-16 01:02:34

🌟 Advantages¶

  • Reusable across 'year', 'month', 'dayofweek', etc.

  • Easy to change aggregation type ('sum', 'median', etc.)

  • Consistent naming and sorting

  • Makes your code far more modular for dashboards or reporting

Why Reusability Matters¶

  • 💡 Scalability: You can plug your functions into larger pipelines or production environments without rewrites.
  • 🛠️ Maintainability: A bug fix in one utility can instantly improve multiple workflows.
  • 🚀 Efficiency: Spend less time rewriting logic and more time interpreting results.

Why This Matters for Rossmann Store Sales¶

  • We’ll likely repeat the same aggregations or visualizations across hundreds of stores.
  • Promos, holidays, and weekday patterns demand consistent filtering and analysis.
  • Modular functions help you prototype insights fast, scale across stores, and iterate smoothly.

In [18]:
# End analysis
viz_impact_analysis_end = pd.Timestamp.now()
duration = viz_impact_analysis_end - viz_impact_analysis_begin

# Final summary print
print("\n📋 Features Engineering && Data Viz Summary")
print(f"🟢 Begin Date: {bold_start}{viz_impact_analysis_begin.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
print(f"✅ End Date:   {bold_start}{viz_impact_analysis_end.strftime('%Y-%m-%d %H:%M:%S')}{bold_end}")
print(f"⏱️ Duration:   {bold_start}{str(duration)}{bold_end}")
📋 Features Engineering && Data Viz Summary
🟢 Begin Date: 2025-08-16 01:02:27
✅ End Date:   2025-08-16 01:02:34
⏱️ Duration:   0 days 00:00:07.113379

Project Design Rationale: Notebook Separation¶

To promote clarity, maintainability, and scalability, the project adopts a modular notebook architecture. Each phase of the data workflow—or lifecycle step—is intentionally organized into its own distinct notebook. This modular approach prevents the accumulation of excessive code in a single notebook, making it easier to debug, update, and collaborate across different stages of the workflow. By isolating data transformation logic from visual analysis, each notebook remains focused and purpose-driven, ultimately enhancing the overall efficiency and readability of the project.